Skip to content

feat: add JoyImage edit plus#14032

Open
tangyanf wants to merge 17 commits into
huggingface:mainfrom
tangyanf:add-joyimage-edit-plus
Open

feat: add JoyImage edit plus#14032
tangyanf wants to merge 17 commits into
huggingface:mainfrom
tangyanf:add-joyimage-edit-plus

Conversation

@tangyanf

@tangyanf tangyanf commented Jun 22, 2026

Copy link
Copy Markdown

Description

We are the JoyAI Team, and this is the Diffusers implementation for the JoyAI-Image-Edit-Plus model.

GitHub Repository: [https://github.com/jd-opensource/JoyAI-Image]
Hugging Face Model: [https://huggingface.co/jdopensource/JoyAI-Image-Edit-Plus-Diffusers]
Original opensource weights: [https://huggingface.co/jdopensource/JoyAI-Image-Edit-Plus]
Fixes #14049

Model Overview

JoyAI-Image-Edit-Plus extends JoyAI-Image-Edit with multi-image editing capabilities. While JoyAI-Image-Edit operates on a single reference image, Edit-Plus accepts multiple reference
images as input and performs instruction-guided editing across them — enabling tasks such as subject composition, style transfer from multiple sources, and multi-view consistent editing.

It combines an 8B Multimodal Large Language Model (MLLM) with a 16B Multimodal Diffusion Transformer (MMDiT), supporting variable-resolution reference images that are independently
encoded and jointly denoised.

Key Features

  • Multi-Image Input: Accepts multiple reference images with different resolutions, enabling complex editing scenarios that require information from multiple visual sources.
  • Subject Composition: Combine elements from separate images into a coherent output guided by text instructions (e.g., "Let the person lovingly play with the dog" given separate person
    and dog images).
  • Cross-Image Style Transfer: Apply style or attributes from one reference image to subjects in another.
  • Variable Resolution Support: Each reference image is independently resized and encoded at its optimal resolution, preserving fine-grained details regardless of input size.
  • Instruction-Guided Generation: Natural language prompts control how multiple reference images are composed and edited in the final output.

@github-actions github-actions Bot added models pipelines size/L PR with diff > 200 LOC labels Jun 22, 2026
@github-actions

Copy link
Copy Markdown
Contributor

Hi @tangyanf, thanks for the PR! It does not appear to link an issue it fixes. If this PR addresses an existing issue, please add a closing keyword (e.g. Fixes #1234) to the PR description so the issue is linked. See the contribution guide for more details. If this PR intentionally does not fix a tracked issue, a maintainer can add the no-issue-needed label to silence this reminder.

@yiyixuxu yiyixuxu added the no-issue-needed for PRs that do not require link to an issue label Jun 22, 2026
sergereview[bot]
sergereview Bot previously requested changes Jun 22, 2026

@sergereview sergereview Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤗 Serge says:

This PR adds the JoyImage Edit Plus model and pipeline. There are several blocking issues that need to be addressed before merging.

Blocking — Debug artifacts left in production code

Multiple torch.save() calls, a print() statement, and a commented-out exit(0) are left in pipeline_joyimage_edit_plus.py. These will write files to the user's working directory and print to stdout during every inference call.

Blocking — einops dependency

Per .ai/models.md: "No new mandatory dependency without discussion (e.g. einops). Optional deps guarded with is_X_available() and a dummy in utils/dummy_*.py." The pipeline directly imports from einops import rearrange — this is the only non-comment usage of einops in src/diffusers/. The rearrange calls should be rewritten with native PyTorch (reshape, permute, unflatten).

Blocking — sglang integration code in model forward

The transformer's forward method contains sglang-specific code: list-unwrapping for "SglangXvideo CFG branches" (lines 272-276) and a try: from sglang... fallback (lines 279-287). Per .ai/AGENTS.md: "No defensive code, unused code paths, or legacy stubs — do not add fallback paths, safety checks, or configuration options 'just in case'." This code doesn't belong in the diffusers model — the pipeline always passes the required arguments.

Blocking — Missing dummy objects

JoyImageEditPlusTransformer3DModel, JoyImageEditPlusPipeline, and JoyImageEditPlusPipelineOutput are not registered in dummy_pt_objects.py / dummy_torch_and_transformers_objects.py. This will cause ImportError when torch/transformers are not installed.

Blocking — Missing tests

No test files were added for the new model or pipeline.

Blocking — Hardcoded device_type="cuda" in torch.autocast

torch.autocast(device_type="cuda", ...) is hardcoded in two places in the pipeline. This will fail on MPS, XPU, and other non-CUDA devices.

Non-blocking — Inlined scheduler sigma math

Per .ai/pipelines.md gotcha #3, the pipeline manually computes shifted sigmas and temporarily overrides self.scheduler.shift — this is exactly what FlowMatchEulerDiscreteScheduler does with its shift config. The scheduler should own this logic.

Non-blocking — Unused imports and parameters

  • import inspect in transformer_joyimage_edit_plus.py is unused.
  • enable_denormalization parameter is declared in prepare_latents and __call__ but never read.
  • retrieve_timesteps is duplicated from the existing pipeline without a # Copied from annotation.

serge v0.1.0 · model: claude-opus-4-6 · 29 LLM turns · 50 tool calls · 190.2s · 1602502 in / 7369 out tokens

Comment thread src/diffusers/pipelines/joyimage/pipeline_joyimage_edit_plus.py Outdated
Comment thread src/diffusers/pipelines/joyimage/pipeline_joyimage_edit_plus.py Outdated
Comment thread src/diffusers/pipelines/joyimage/pipeline_joyimage_edit_plus.py Outdated
Comment thread src/diffusers/pipelines/joyimage/pipeline_joyimage_edit_plus.py Outdated
Comment thread src/diffusers/models/transformers/transformer_joyimage_edit_plus.py Outdated
Comment thread src/diffusers/pipelines/joyimage/pipeline_joyimage_edit_plus.py Outdated
Comment thread src/diffusers/pipelines/joyimage/pipeline_joyimage_edit_plus.py Outdated
Comment thread src/diffusers/pipelines/joyimage/pipeline_joyimage_edit_plus.py Outdated
Comment thread src/diffusers/pipelines/joyimage/pipeline_output.py Outdated
Comment thread src/diffusers/pipelines/joyimage/pipeline_joyimage_edit_plus.py Outdated
tangyanfei.8 added 5 commits June 24, 2026 09:35
    - Remove einops dependency: replace rearrange with reshape/permute
    - Remove sglang-specific code from transformer forward
    - Remove unused import inspect from transformer
    - Fix hardcoded device_type="cuda" to use device.type
    - Simplify scheduler sigma math: delegate to retrieve_timesteps
    - Remove unused enable_denormalization parameter
    - Fix callback latents variable binding
    - Fix output_type="pt" to return stacked tensor
    - Set return_dict default to True in transformer forward
    - Add dummy objects for JoyImageEditPlus classes
    - Add transformer and pipeline test files
@tangyanf tangyanf force-pushed the add-joyimage-edit-plus branch from 6f2763a to 8a911e5 Compare June 24, 2026 09:48
@github-actions github-actions Bot added the documentation Improvements or additions to documentation label Jun 24, 2026
@yiyixuxu

Copy link
Copy Markdown
Collaborator

@claude can you do a review here?

@github-actions

github-actions Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Claude encountered an error —— View job


I'll analyze this and get back to you.

@yiyixuxu yiyixuxu left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, i left some feedbacks

Comment thread src/diffusers/models/transformers/transformer_joyimage_edit_plus.py Outdated
Comment thread src/diffusers/models/transformers/transformer_joyimage_edit_plus.py Outdated
Comment thread src/diffusers/models/transformers/transformer_joyimage_edit_plus.py Outdated
Comment thread src/diffusers/models/transformers/transformer_joyimage_edit_plus.py Outdated
Comment thread src/diffusers/models/transformers/transformer_joyimage_edit_plus.py Outdated
Comment thread src/diffusers/pipelines/joyimage/pipeline_joyimage_edit_plus.py
Comment thread src/diffusers/pipelines/joyimage/pipeline_joyimage_edit_plus.py Outdated
Comment thread src/diffusers/pipelines/joyimage/pipeline_joyimage_edit_plus.py Outdated
@tangyanf tangyanf requested a review from yiyixuxu June 25, 2026 02:37
@tangyanf

Copy link
Copy Markdown
Author

thanks, i left some feedbacks

@yiyixuxu Thank you for taking the time to review this PR! I've addressed all the feedback — here's a summary of the changes:

  1. Removed cross-imports from transformer_joyimage.py; copied and renamed classes with # Copied from annotations.
  2. Removed **kwargs from JoyImageEditPlusAttnProcessor.call.
  3. Removed cos.ndim == 2 branch in _apply_rotary_emb_batched (only batched path kept).
  4. Made shape_list a required argument, removed None default and ValueError check.
  5. Removed conditional on vec.unflatten.
  6. Replaced _resize_center_crop with self.vae_image_processor.resize_center_crop().
  7. Replaced _get_bucket_size with self.vae_image_processor.get_default_height_width().
  8. Added # Copied from comment to retrieve_timesteps.

Please let me know if there's anything else that needs to be updated!

@tangyanf

Copy link
Copy Markdown
Author

🤗 Serge says:

This PR adds the JoyImage Edit Plus model and pipeline. There are several blocking issues that need to be addressed before merging.

Blocking — Debug artifacts left in production code

Multiple torch.save() calls, a print() statement, and a commented-out exit(0) are left in pipeline_joyimage_edit_plus.py. These will write files to the user's working directory and print to stdout during every inference call.

Blocking — einops dependency

Per .ai/models.md: "No new mandatory dependency without discussion (e.g. einops). Optional deps guarded with is_X_available() and a dummy in utils/dummy_*.py." The pipeline directly imports from einops import rearrange — this is the only non-comment usage of einops in src/diffusers/. The rearrange calls should be rewritten with native PyTorch (reshape, permute, unflatten).

Blocking — sglang integration code in model forward

The transformer's forward method contains sglang-specific code: list-unwrapping for "SglangXvideo CFG branches" (lines 272-276) and a try: from sglang... fallback (lines 279-287). Per .ai/AGENTS.md: "No defensive code, unused code paths, or legacy stubs — do not add fallback paths, safety checks, or configuration options 'just in case'." This code doesn't belong in the diffusers model — the pipeline always passes the required arguments.

Blocking — Missing dummy objects

JoyImageEditPlusTransformer3DModel, JoyImageEditPlusPipeline, and JoyImageEditPlusPipelineOutput are not registered in dummy_pt_objects.py / dummy_torch_and_transformers_objects.py. This will cause ImportError when torch/transformers are not installed.

Blocking — Missing tests

No test files were added for the new model or pipeline.

Blocking — Hardcoded device_type="cuda" in torch.autocast

torch.autocast(device_type="cuda", ...) is hardcoded in two places in the pipeline. This will fail on MPS, XPU, and other non-CUDA devices.

Non-blocking — Inlined scheduler sigma math

Per .ai/pipelines.md gotcha #3, the pipeline manually computes shifted sigmas and temporarily overrides self.scheduler.shift — this is exactly what FlowMatchEulerDiscreteScheduler does with its shift config. The scheduler should own this logic.

Non-blocking — Unused imports and parameters

  • import inspect in transformer_joyimage_edit_plus.py is unused.
  • enable_denormalization parameter is declared in prepare_latents and __call__ but never read.
  • retrieve_timesteps is duplicated from the existing pipeline without a # Copied from annotation.

serge v0.1.0 · model: claude-opus-4-6 · 29 LLM turns · 50 tool calls · 190.2s · 1602502 in / 7369 out tokens

re check

@tangyanf

Copy link
Copy Markdown
Author

@tarekziade Hi! I've addressed all the feedback from sergereview[bot] in my latest commits. However, the bot's "changes requested" review is still blocking the merge. Could you help me with one of the following:

  1. Dismiss the bot's review now that the changes have been addressed, or
  2. Let me know how to trigger a re-review from the bot so it can re-evaluate the updated code?

Thanks!

…t_to_diffusers.py

	JoyImage Edit and Edit Plus share identical VAE and transformer weight
      layouts — only the target model class differs. Consolidate both into a
      single script with a --model_type flag (edit | edit_plus) instead of
      maintaining two nearly-duplicate files.
Comment thread src/diffusers/models/transformers/transformer_joyimage.py Outdated
@tangyanf tangyanf requested a review from yiyixuxu July 2, 2026 06:16
@HuggingFaceDocBuilderDev

Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@dg845

dg845 commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

@bot /style

@github-actions

github-actions Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Style bot fixed some files and pushed the changes.

github-actions Bot and others added 6 commits July 3, 2026 01:39
  - Add missing Returns section to JoyImageEditPlusTransformer3DModel.forward docstring
  - Fix alphabetical ordering of dummy classes in dummy_pt_objects.py
   Replace `torch.autocast(dtype=torch.float32)` + `.float()` with
   `.to(self.vae.dtype)` for both VAE encode and decode calls.

   The previous approach caused dtype mismatch (float32 input vs bfloat16
   bias) on CPU where autocast does not automatically cast conv weights,
   breaking the CI `test_layerwise_casting_inference` test.
@tangyanf

tangyanf commented Jul 3, 2026

Copy link
Copy Markdown
Author

Hi @yiyixuxu @dg845 , all review comments have been addressed and CI is green. Happy to make any further changes if needed. Let me know if there is anything on my end needed for the merge.
Thanks!

@dg845 dg845 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Left some design comments :).

Comment on lines +118 to +123
if image_rotary_emb is not None:
vis_freqs, txt_freqs = image_rotary_emb
if vis_freqs is not None:
img_query, img_key = _apply_rotary_emb_batched(img_query, img_key, vis_freqs)
if txt_freqs is not None:
txt_query, txt_key = _apply_rotary_emb_batched(txt_query, txt_key, txt_freqs)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we always set the text RoPE embeddings to None in JoyImageEditPlusTransformer3DModel.forward, I think we can simplify the code here to remove the text RoPE path and image_rotary_emb to only contain the image RoPE cos and sin components (which would also match the current variables names better).

@register_to_config
def __init__(
self,
patch_size: list = [1, 2, 2],

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
patch_size: list = [1, 2, 2],
patch_size: list[int] = [1, 2, 2],

nit: improve type hint

Comment on lines +387 to +392
self.patch_size = patch_size
self.hidden_size = hidden_size
self.num_attention_heads = num_attention_heads
self.rope_dim_list = rope_dim_list
self.rope_type = rope_type
self.theta = theta

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to define these attributes explicitly since the __init__ arguments are already registered to the config. For example, we could use self.config.hidden_size in _get_rotary_pos_embed_for_range without needing to define self.hidden_size here.

Comment on lines +175 to +178
self.image_processor = VaeImageProcessor(vae_scale_factor=self.vae_scale_factor_spatial)
self.vae_image_processor = JoyImageEditImageProcessor(
vae_scale_factor=self.vae_scale_factor_spatial,
)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.image_processor = VaeImageProcessor(vae_scale_factor=self.vae_scale_factor_spatial)
self.vae_image_processor = JoyImageEditImageProcessor(
vae_scale_factor=self.vae_scale_factor_spatial,
)
self.image_processor = JoyImageEditImageProcessor(vae_scale_factor=self.vae_scale_factor_spatial)

Only keep the JoyImageEditImageProcessor (since the VaeImageProcessor is unused) and rename it to self.image_processor, which is the standard diffusers name.

padding = torch.zeros((x.shape[0], padding_length), dtype=x.dtype, device=x.device)
return torch.cat([x, padding], dim=1)

def normalize_latents(self, latent: torch.Tensor) -> torch.Tensor:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should inline normalize_latents and denormalize_latents, as they are only called once and this would follow the diffusers design better. We should also be able to simplify the code here as AutoencoderKLWan should always have latents_mean and latents_std in its config.


# Prepare latents (patchified)
num_channels_latents = self.transformer.config.in_channels
padded_latents, target_mask, shape_list = self.prepare_latents(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
padded_latents, target_mask, shape_list = self.prepare_latents(
latents, target_mask, shape_list = self.prepare_latents(

nit: I think it would be more clear if we use latents throughout.

guidance_scale: float = 4.0,
negative_prompt: str | list[str] | None = None,
generator: torch.Generator | list[torch.Generator] | None = None,
latents: torch.Tensor | None = None,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to support pre-computed latents (e.g. in prepare_latents)? Most of the other pipelines support this.

Comment on lines +623 to +626
# Zero out padding text tokens to prevent them from corrupting attention
# (original uses explicit attention masking; here we neutralize padding values)
if prompt_embeds_mask is not None:
prompt_embeds = prompt_embeds * prompt_embeds_mask.unsqueeze(-1)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we use an attention mask here?

# img_tensor: [C, T, H, W] -> [C, H, W] (T=1)
img_tensor = img_tensor[:, 0]
if output_type == "pil":
img_np = (img_tensor.permute(1, 2, 0).cpu().float().numpy() * 255).clip(0, 255).astype(np.uint8)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to use JoyImageEditImageProcessor.postprocess to postprocess the decoded VAE outputs? postprocess should be able to handle the output_type logic as well.

Comment on lines +735 to +736
if not return_dict:
return image

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if not return_dict:
return image
if not return_dict:
return (image,)

nit: if return_dict=False, we usually return a tuple, even if it only has one element. See for example Flux 2:

if not return_dict:
return (image,)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation fixes-issue models no-issue-needed for PRs that do not require link to an issue pipelines size/L PR with diff > 200 LOC tests utils

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add JoyAI-Image Edit Plus pipeline and model

4 participants